Data integration and linking (with survey data)

Advanced Geospatial Data Processing for Social Scientists

Dennis Abel & Stefan Jünger

2025-04-29

Day Time Title
April 28 10:00-11:15 Introduction
April 28 11:15-11:30 Coffee Break
April 28 11:30-13:00 Raster data in R
April 28 13:00-14:00 Lunch Break
April 28 14:00-15:15 Raster data processing
April 28 15:15-15:30 Coffee Break
April 28 15:30-17:00 Graphical display of raster data in maps
April 29 10:00-11:15 Datacube processing I
April 29 11:15-11:30 Coffee Break
April 29 11:30-13:00 Datacube processing II & API access
April 29 13:00-14:00 Lunch Break
April 29 14:00-15:15 Data integration and linking (with survey data)
April 29 15:15-15:30 Coffee Break
April 29 15:30-17:00 Outlook and open session with own application

Survey data: ISSP Environment (1993-2020)

After downloading the ISSP data file from the website, we placed it in our ./data folder. We have already prepared a script in the folder ./R called prep_issp.r to prepare the data, which we call using source().

source("./R/prep_issp.R")

head(issp, 10)
   year   country concern
1  1993 Australia       4
2  1993 Australia       4
3  1993 Australia       5
4  1993 Australia       3
5  1993 Australia       3
6  1993 Australia       3
7  1993 Australia       3
8  1993 Australia       4
9  1993 Australia       3
10 1993 Australia       3

Climate concern in 2020

There’s a nice item in the ISSP where respondents could evaluate whether

“A rise in world’s temperature is (dangerous/ not dangerous) for environment”1

EO indicators

We want to investigate whether temperature anomalies in the year of the survey, compared to a long-running average, are associated with climate change concerns.

  1. Indicator: Temperature - annual average
  2. Intensity: Anomaly (mean deviation)
  3. Focal time period: 1993, 2000, 2010, 2020
  4. Baseline period: 1961-1990
  5. Spatial buffer: Country

ERA5 data

The ERA5-Land Reanalysis from the Copernicus Climate Change Service is a suitable data product for this temperature indicator. It records observations on air temperature at 2 meters above the surface from 1950 onwards, has a spatial resolution of 0.1x0.1 degrees, and has global spatial coverage. We can apply our conceptualization exactly to these data.

Data access and preparation

To access the data, we need an ECMWF account. Utilizing the ecmwfr package, we can access the data directly in R. Given that we want to aggregate the data at the country level, we first load country vector data and download the data according to the spatial extent of the countries included in the survey. The ISSP has a diverse membership from North and South America, Europe, Africa, and Asia. Thus, we can work with a global spatial extent when downloading the EO indicator.

We also need some packages to load and prepare the world map and process the raster files (rnaturalearth, sf, terra, and tidyverse). We also need the keyring package to safely store our ECMWF-API key.

required_packages <- 
  c("keyring", "rnaturalearth", "sf", "tidyverse", "terra", "devtools")

new_packages <- 
  required_packages[!(required_packages %in% installed.packages()[,"Package"])]

if(length(new_packages)) install.packages(new_packages)

lapply(required_packages, library, character.only = TRUE)

Country level shapefile

We load the vector data containing country-level polygons and subset it to the most relevant variables.

world <- 
  rnaturalearth::ne_countries(
    scale = "medium", 
    returnclass = "sf"
  ) |> 
  dplyr::select(
    admin, 
    iso_a3, 
    geometry
  )

plot(sf::st_geometry(world))

Accessing the Copernicus API

A final step before we can access the data from the Copernicus API is to store our API key. The function automatically retrieves the key by setting it to "wf_api_key".

# Store as environment variable
Sys.setenv(WF_API_KEY = "MY-API-KEY")

api_key <- Sys.getenv("WF_API_KEY")

keyring::key_set_with_value(service = "wf_api_key", password = api_key)

Now we can access the data. We loop the download over the four years of the survey program (1993, 2000, 2010, 2020) to create four separate files.

Downloading the data

# API access looped over four years
for (yr in c("1993", "2000", "2010", "2020")) {
  
  # Create file names which include year
  file_name <- paste0("era5_temperature", yr, ".grib")
  
  # Specify API request
  request <- 
    list(
      data_format = "grib",
      variable = "2m_temperature",
      product_type = "monthly_averaged_reanalysis",
      time = "00:00",
      year = yr,
      month = 
        c(
          "01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12"
        ),
      area = c(90, -180, -90, 180),
      dataset_short_name = "reanalysis-era5-land-monthly-means",
      target = file_name
    )
  
  # Download data from C3S
  file_path <- 
    ecmwfr::wf_request(
      request = request,
      transfer = TRUE,
      path = "./data/EO_data/C3S_data",
      verbose = FALSE
    )
}

Importing the data

Importing the data into R should feel natural to us right now. We simply use terra::rast() for this endevaour.

temp_1993 <- terra::rast("./data/EO_data/C3S_data/era5_temperature1993.grib")
temp_2000 <- terra::rast("./data/EO_data/C3S_data/era5_temperature2000.grib")
temp_2010 <- terra::rast("./data/EO_data/C3S_data/era5_temperature2010.grib")
temp_2020 <- terra::rast("./data/EO_data/C3S_data/era5_temperature2020.grib")

Inspecting the data

Let’s inspect the datacube for 2020 and plot the first layer of the 2020 datacube (January 2020). The file’s attributes tell us information on the dimensions (number of rows, columns, and layers), the resolution, spatial extent, the coordinate reference system, units, and time points.

temp_2020
class       : SpatRaster 
dimensions  : 1801, 3600, 12  (nrow, ncol, nlyr)
resolution  : 0.1, 0.1  (x, y)
extent      : -180.05, 179.95, -90.05, 90.05  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat Coordinate System imported from GRIB file 
source      : era5_temperature2020.grib 
names       : SFC (~e [C], SFC (~e [C], SFC (~e [C], SFC (~e [C], SFC (~e [C], SFC (~e [C], ... 
unit        :           C,           C,           C,           C,           C,           C, ... 
time        : 2020-01-01 01:00:00 to 2020-12-01 01:00:00 (2 steps) UTC 
plot(temp_2020[[1]])

Aggregating to the country level

Now, we can aggregate the monthly values by year and country. If necessary, we will check that our country polygons and the raster files have the same CRS and align.

for (yr in c("1993", "2000", "2010", "2020")) {
  temp_data <- get(paste0("temp_", yr))
  
  # Check CRS of both datasets and adjust if necessary
  if(!identical(terra::crs(world), terra::crs(temp_data))) {
    world <- 
      world |>
      sf::st_transform(sf::st_crs(temp_data))
  }
  
  # Collapse the month layers into one layer by averaging across months
  annual_values <- terra::app(temp_data, fun = mean, na.rm = TRUE, cores = 4)
  
  # Aggregate by country
  country_values <- 
    terra::extract(
      annual_values,
      world,
      fun = mean,
      na.rm = TRUE
    )
  
  # Add values to shapefile
  world[paste0("temp_", yr)] <- country_values[, 2]
}

The result

We now have our country polygon vector file with yearly mean temperatures for each survey year.

head(world, 2)
Simple feature collection with 2 features and 6 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 21.97891 ymin: -22.40205 xmax: 33.66152 ymax: -8.193652
Geodetic CRS:  Coordinate System imported from GRIB file
     admin iso_a3                       geometry temp_1993 temp_2000 temp_2010
1 Zimbabwe    ZWE MULTIPOLYGON (((31.28789 -2...  294.3789  293.3119  294.4570
2   Zambia    ZMB MULTIPOLYGON (((30.39609 -1...  294.8202  294.7319  295.1083
  temp_2020
1  294.6193
2  295.2352
plot(world["temp_2020"])

Gathering the baseline data

Now that we have the focal values for all four survey years, we redo the process for the baseline period (1961-1990).

# Specify API request
request <- 
  list(
    data_format = "grib",
    variable = "2m_temperature",
    product_type = "monthly_averaged_reanalysis",
    time = "00:00",
    year = as.character(1961:1970),
    month = 
      c("01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12"),
    area = c(90, -180, -90, 180),
    dataset_short_name = "reanalysis-era5-land-monthly-means",
    target = "era5_temperature1961-1990.grib"
  )

# Download data from C3S
file_path <- 
  ecmwfr::wf_request(
    request = request,
    transfer = TRUE,
    path = "./data/EO_data/C3S_data",
    verbose = FALSE
  )

Importing the data

Again, we use terra::rast() to import the data.

temp_base <- 
  terra::rast("./data/EO_data/C3S_data/era5_temperature1961-1990.grib")

Aggregating to the country level

We also aggregate these data at the country level and add them to our country polygon vector data.

# Check CRS of both datasets and adjust if necessary
if(!identical(terra::crs(world), terra::crs(temp_base))) {
  world <- 
    world |>
    sf::st_transform(crs = sf::st_crs(temp_base))
}

# Collapse all into one layer by averaging across months and years
annual_values <- terra::app(temp_base, fun = mean, na.rm = TRUE, cores = 4)

# Aggregate by country
country_values <- 
  terra::extract(
    annual_values,
    world,
    fun = mean,
    na.rm = TRUE
  )

# Add values to vector data
world$temp_base <- country_values[, 2]

The result

head(world)
Simple feature collection with 6 features and 7 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -73.36621 ymin: -22.40205 xmax: 109.4449 ymax: 41.9062
Geodetic CRS:  Coordinate System imported from GRIB file
      admin iso_a3                       geometry temp_1993 temp_2000 temp_2010
1  Zimbabwe    ZWE MULTIPOLYGON (((31.28789 -2...  294.3789  293.3119  294.4570
2    Zambia    ZMB MULTIPOLYGON (((30.39609 -1...  294.8202  294.7319  295.1083
3     Yemen    YEM MULTIPOLYGON (((53.08564 16...  297.5421  297.9850  298.3186
4   Vietnam    VNM MULTIPOLYGON (((104.064 10....  295.9980  295.8958  296.7712
5 Venezuela    VEN MULTIPOLYGON (((-60.82119 9...  297.7739  297.5062  298.3977
6   Vatican    VAT MULTIPOLYGON (((12.43916 41...  288.3732  289.0528  288.5110
  temp_2020 temp_base
1  294.6193  294.1319
2  295.2352  294.8179
3  298.2873  297.3228
4  296.7613  295.6958
5  298.7989  297.8074
6  289.6033  288.1982

Calculating deviations

Now that we have the focal and baseline values, we calculate single deviations.

world <- 
  world |>
  dplyr::mutate(
    diff_1993 = temp_1993 - temp_base,
    diff_2000 = temp_2000 - temp_base,
    diff_2010 = temp_2010 - temp_base,
    diff_2020 = temp_2020 - temp_base
  )

# Plot 2020 deviation from baseline
ggplot(data = world) +
  geom_sf(aes(fill = diff_2020)) +
  scale_fill_viridis_c() +
  theme_minimal() +
  labs(
    title = 
      "Absolute deviation between 2020 and baseline temperature",
    subtitle = "Averaged across countries",
    fill = "Temperature (K)"
  )

Let’s use these data

Remember, our question was whether temperature deviations from a baseline period (1961-1990) in the survey year correspond with the overall climate concern within a country. We can assess this question, e.g., by aggregating the survey data as follows. Note that we create a within-country z-standardized measure for climate concern along the way.

issp_aggregated <-
  issp |> 
  dplyr::mutate(
    country = 
      ifelse(
        country == "USA", 
        "United States of America", 
        country
      )
  ) |> 
  dplyr::group_by(country) |> 
  dplyr::mutate(concern = scale(concern)) |> 
  dplyr::group_by(country, year) |> 
  dplyr::summarize(
    concern = mean(concern, na.rm = TRUE), 
    .groups = "drop"
  )

issp_aggregated
# A tibble: 102 × 3
   country    year  concern
   <chr>     <dbl>    <dbl>
 1 Australia  1993  0.140  
 2 Australia  2010 -0.263  
 3 Australia  2020  0.227  
 4 Austria    2000 -0.0158 
 5 Austria    2010 -0.225  
 6 Austria    2020  0.187  
 7 Bulgaria   1993  0.0797 
 8 Bulgaria   2000 -0.0766 
 9 Bulgaria   2010 -0.00632
10 Canada     1993  0.0335 
# ℹ 92 more rows

Structurally harmonzing the world data

Formally, the ISSP data are now in an aggregated long format. We have to wrangle our world data, including the temperature measures, to harmonize both datasets. Note that we z-standardize temperature differences across the whole period and countries this time, as they do not differ within countries in a specific year.

world_restructured <-
  world |> 
  sf::st_drop_geometry() |> 
  dplyr::select(
    country = admin, 
    dplyr::contains("diff_")
  ) |> 
  tidyr::pivot_longer(
    cols = dplyr::contains("diff_"),
    values_to = "temp_diff",
    names_to = "diff"
  ) |> 
  dplyr::mutate(
    year = rep(c(1993, 2000, 2010, 2020), 242)
  ) |> 
  dplyr::mutate(temp_diff = scale(temp_diff))

world_restructured
# A tibble: 968 × 4
   country  diff      temp_diff[,1]  year
   <chr>    <chr>             <dbl> <dbl>
 1 Zimbabwe diff_1993      -0.549    1993
 2 Zimbabwe diff_2000      -1.95     2000
 3 Zimbabwe diff_2010      -0.447    2010
 4 Zimbabwe diff_2020      -0.233    2020
 5 Zambia   diff_1993      -0.871    1993
 6 Zambia   diff_2000      -0.987    2000
 7 Zambia   diff_2010      -0.492    2010
 8 Zambia   diff_2020      -0.325    2020
 9 Yemen    diff_1993      -0.586    1993
10 Yemen    diff_2000      -0.00337  2000
# ℹ 958 more rows

Linking the data

We can now link both datasets to retrieve a final dataset we can analyze. We also converted it to a long format for plotting.

issp_linked <-
  dplyr::left_join(
    issp_aggregated, world_restructured, by = c("country", "year")
  ) |> 
  tidyr::pivot_longer(
    cols = c("concern", "temp_diff"), names_to = "variable"
  )

issp_linked
# A tibble: 204 × 5
   country    year diff      variable  value[,1]
   <chr>     <dbl> <chr>     <chr>         <dbl>
 1 Australia  1993 diff_1993 concern      0.140 
 2 Australia  1993 diff_1993 temp_diff   -1.03  
 3 Australia  2010 diff_2010 concern     -0.263 
 4 Australia  2010 diff_2010 temp_diff   -1.26  
 5 Australia  2020 diff_2020 concern      0.227 
 6 Australia  2020 diff_2020 temp_diff    0.0272
 7 Austria    2000 diff_2000 concern     -0.0158
 8 Austria    2000 diff_2000 temp_diff    1.39  
 9 Austria    2010 diff_2010 concern     -0.225 
10 Austria    2010 diff_2010 temp_diff   -0.586 
# ℹ 194 more rows

The analysis result